In our final group assignment we will analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city.

#By utilizing the skim function, it is possible to define a schematic organization of our data and identify some crucial elements. It is possible to notice, first of all, that as far as the variable price is concerned, the dataset does not miss any value and the same is also true for the property type. The mean values for beds and bedrooms are both around 2 while bathrooms statistics are not available since their data are missing. The number of nights spent is between 6 and 8 and on avergae 3 people live in the house.
skim(listings)
Data summary
Name listings
Number of rows 31030
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 31030 0
name 11 1.00 1 250 0 30309 0
description 1179 0.96 1 1000 0 29256 0
neighborhood_overview 12691 0.59 1 1000 0 16565 0
picture_url 0 1.00 61 126 0 30547 0
host_url 0 1.00 39 43 0 23467 0
host_name 13 1.00 1 35 0 7400 0
host_location 45 1.00 2 255 0 1752 0
host_about 15101 0.51 1 9009 0 10796 22
host_response_time 13 1.00 3 18 0 5 0
host_response_rate 13 1.00 2 4 0 41 0
host_acceptance_rate 13 1.00 2 4 0 95 0
host_thumbnail_url 13 1.00 55 106 0 23334 0
host_picture_url 13 1.00 57 109 0 23334 0
host_neighbourhood 12457 0.60 4 30 0 233 0
host_verifications 0 1.00 2 161 0 443 0
neighbourhood 12690 0.59 9 60 0 699 0
neighbourhood_cleansed 0 1.00 4 16 0 38 0
property_type 0 1.00 3 35 0 94 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 34 1.00 6 17 0 34 0
amenities 0 1.00 2 1520 0 28475 0
price 0 1.00 5 10 0 1002 0
license 29649 0.04 3 20 0 833 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-08 2021-09-09 2021-09-09 2
host_since 13 1.00 2009-03-20 2021-09-02 2015-12-30 3578
calendar_last_scraped 0 1.00 2021-09-08 2021-09-09 2021-09-09 2
first_review 9148 0.71 2011-03-09 2021-09-07 2018-09-21 2703
last_review 9148 0.71 2011-11-16 2021-09-09 2019-07-29 2394

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 13 1 0.12 FAL: 27336, TRU: 3681
host_has_profile_pic 13 1 1.00 TRU: 30876, FAL: 141
host_identity_verified 13 1 0.75 TRU: 23184, FAL: 7833
neighbourhood_group_cleansed 31030 0 NaN :
bathrooms 31030 0 NaN :
calendar_updated 31030 0 NaN :
has_availability 0 1 0.99 TRU: 30640, FAL: 390
instant_bookable 0 1 0.36 FAL: 19714, TRU: 11316

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.577473e+07 13670837.08 1.115600e+04 1.500225e+07 2.412513e+07 3.826905e+07 5.211617e+07 ▆▆▇▆▅
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 9.613615e+07 99177732.26 1.085700e+04 1.982952e+07 5.237223e+07 1.524548e+08 4.212017e+08 ▇▂▁▁▁
host_listings_count 13 1.00 1.508000e+01 158.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.508000e+03 ▇▁▁▁▁
host_total_listings_count 13 1.00 1.508000e+01 158.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.508000e+03 ▇▁▁▁▁
latitude 0 1.00 -3.386000e+01 0.07 -3.414000e+01 -3.390000e+01 -3.388000e+01 -3.383000e+01 -3.340000e+01 ▁▇▃▁▁
longitude 0 1.00 1.512000e+02 0.09 1.506000e+02 1.511800e+02 1.512200e+02 1.512600e+02 1.513400e+02 ▁▁▁▃▇
accommodates 0 1.00 3.240000e+00 2.12 1.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
bedrooms 1998 0.94 1.660000e+00 1.05 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.600000e+01 ▇▁▁▁▁
beds 395 0.99 1.910000e+00 1.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.700000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 6.560000e+00 31.66 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.569800e+02 528.24 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 1.500000e+03 ▆▁▁▇▁
minimum_minimum_nights 0 1.00 6.400000e+00 31.04 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 6.980000e+00 31.48 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 7.620090e+05 40426410.17 1.000000e+00 3.200000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 1.869321e+06 63319675.46 1.000000e+00 3.500000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 6.680000e+00 31.22 1.000000e+00 1.300000e+00 3.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 1.866813e+06 63234821.62 1.000000e+00 3.500000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
availability_30 0 1.00 8.350000e+00 12.52 0.000000e+00 0.000000e+00 0.000000e+00 2.300000e+01 3.000000e+01 ▇▁▁▁▃
availability_60 0 1.00 1.794000e+01 25.68 0.000000e+00 0.000000e+00 0.000000e+00 5.300000e+01 6.000000e+01 ▇▁▁▁▃
availability_90 0 1.00 2.791000e+01 38.94 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+01 9.000000e+01 ▇▁▁▁▃
availability_365 0 1.00 8.871000e+01 130.94 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+02 3.650000e+02 ▇▁▁▁▂
number_of_reviews 0 1.00 1.479000e+01 38.25 0.000000e+00 0.000000e+00 2.000000e+00 1.000000e+01 8.360000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.250000e+00 8.84 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.690000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 4.000000e-02 0.31 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.400000e+01 ▇▁▁▁▁
review_scores_rating 9148 0.71 4.410000e+00 1.17 0.000000e+00 4.500000e+00 4.820000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 10324 0.67 4.740000e+00 0.53 0.000000e+00 4.700000e+00 4.930000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 10308 0.67 4.580000e+00 0.65 0.000000e+00 4.500000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 10336 0.67 4.830000e+00 0.45 0.000000e+00 4.850000e+00 5.000000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 10314 0.67 4.830000e+00 0.47 0.000000e+00 4.860000e+00 5.000000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 10335 0.67 4.820000e+00 0.40 0.000000e+00 4.800000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 10343 0.67 4.640000e+00 0.55 0.000000e+00 4.500000e+00 4.800000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.360000e+00 21.96 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.990000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 5.120000e+00 21.20 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.990000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 1.130000e+00 5.72 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 9.300000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 6.000000e-02 0.59 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+01 ▇▁▁▁▁
reviews_per_month 9148 0.71 6.400000e-01 1.31 1.000000e-02 5.000000e-02 1.500000e-01 6.800000e-01 5.400000e+01 ▇▁▁▁▁
favstats(~number_of_reviews, data = listings)
minQ1medianQ3maxmeansdnmissing
0021083614.838.2310300
# there are on average 15 reviews per house but with notable outliers since the median number is 2.

favstats(~reviews_per_month, data = listings)
minQ1medianQ3maxmeansdnmissing
0.010.050.150.68540.641.31218829148
# per month there are almost 0.5 reviews with values that can shift from 0 to 54 per month.

favstats(~review_scores_rating, data = listings)
minQ1medianQ3maxmeansdnmissing
04.54.82554.411.17218829148
# this statistic is crucial and very informative: it results quite unbiased with a median of 4.82 and a mean of 4.41 (values are included between 0 and 5) computed from more than 10000 reviews.

favstats(~bathrooms, data = listings)
minQ1medianQ3maxmeansdnmissing
NaN031030
#it is important to highlight that the statistic about bathrooms is not available because of missing values.

favstats(~review_scores_cleanliness, data = listings)
minQ1medianQ3maxmeansdnmissing
04.54.81554.580.6532072210308
# as in the case of ratings, cleanliness results to be unbiased with several observations (more than 10000) and shows a median of 4.81 and a mean of 4.58 (with values between 0 and 5).

favstats(~review_scores_communication, data = listings)
minQ1medianQ3maxmeansdnmissing
04.865554.830.4712071610314
# communication too appears to be unbiased and reliable with more than 10000 observations.

favstats(~review_scores_checkin, data = listings)
minQ1medianQ3maxmeansdnmissing
04.855554.830.4522069410336
favstats(~review_scores_location, data = listings)
minQ1medianQ3maxmeansdnmissing
04.84.98554.820.3992069510335
favstats(~maximum_nights, data = listings)
minQ1medianQ3maxmeansdnmissing
1301.12e+031.12e+031.5e+03657528310300
favstats(~minimum_nights, data = listings)
minQ1medianQ3maxmeansdnmissing
11251.12e+036.5631.7310300
# no missing data for both variables, with more than 30000 observations. Standard deviations appear to be very high.
ggplot(data=listings, aes(x=review_scores_rating , y=review_scores_cleanliness  , group=1)) +
  geom_point()+
  ggtitle("Relationship between ratings and cleanliness scores") +
  xlab("Ratings") + ylab("Cleanliness")

#we tested the relationship between ratings and cleaniliness scores in order to understand how changes in the feeling of cleaniliness affect the overall rating score.

ggplot(data=listings, aes(x=host_identity_verified )) + 
  geom_bar(color="black", fill="white")+
   ggtitle("Number of verified hosts per listing")+ 
  xlab("Verified hosts") + ylab("Number of Listings")

# We analyzed the values regarding the verified hosts to understand whether AirBNB considers verified hosts.


ggplot(data=listings, aes (x= review_scores_location ))+
  geom_histogram()+
  stat_bin(bins=30)+ 
   ggtitle("Location ratings") +
  xlab("Ratings") + ylab("Number of Listings")

# We considered also the scores on the basis of the different locations in Sydney
skim(listings)
Data summary
Name listings
Number of rows 31030
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 31030 0
name 11 1.00 1 250 0 30309 0
description 1179 0.96 1 1000 0 29256 0
neighborhood_overview 12691 0.59 1 1000 0 16565 0
picture_url 0 1.00 61 126 0 30547 0
host_url 0 1.00 39 43 0 23467 0
host_name 13 1.00 1 35 0 7400 0
host_location 45 1.00 2 255 0 1752 0
host_about 15101 0.51 1 9009 0 10796 22
host_response_time 13 1.00 3 18 0 5 0
host_response_rate 13 1.00 2 4 0 41 0
host_acceptance_rate 13 1.00 2 4 0 95 0
host_thumbnail_url 13 1.00 55 106 0 23334 0
host_picture_url 13 1.00 57 109 0 23334 0
host_neighbourhood 12457 0.60 4 30 0 233 0
host_verifications 0 1.00 2 161 0 443 0
neighbourhood 12690 0.59 9 60 0 699 0
neighbourhood_cleansed 0 1.00 4 16 0 38 0
property_type 0 1.00 3 35 0 94 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 34 1.00 6 17 0 34 0
amenities 0 1.00 2 1520 0 28475 0
price 0 1.00 5 10 0 1002 0
license 29649 0.04 3 20 0 833 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-08 2021-09-09 2021-09-09 2
host_since 13 1.00 2009-03-20 2021-09-02 2015-12-30 3578
calendar_last_scraped 0 1.00 2021-09-08 2021-09-09 2021-09-09 2
first_review 9148 0.71 2011-03-09 2021-09-07 2018-09-21 2703
last_review 9148 0.71 2011-11-16 2021-09-09 2019-07-29 2394

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 13 1 0.12 FAL: 27336, TRU: 3681
host_has_profile_pic 13 1 1.00 TRU: 30876, FAL: 141
host_identity_verified 13 1 0.75 TRU: 23184, FAL: 7833
neighbourhood_group_cleansed 31030 0 NaN :
bathrooms 31030 0 NaN :
calendar_updated 31030 0 NaN :
has_availability 0 1 0.99 TRU: 30640, FAL: 390
instant_bookable 0 1 0.36 FAL: 19714, TRU: 11316

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.577473e+07 13670837.08 1.115600e+04 1.500225e+07 2.412513e+07 3.826905e+07 5.211617e+07 ▆▆▇▆▅
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 9.613615e+07 99177732.26 1.085700e+04 1.982952e+07 5.237223e+07 1.524548e+08 4.212017e+08 ▇▂▁▁▁
host_listings_count 13 1.00 1.508000e+01 158.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.508000e+03 ▇▁▁▁▁
host_total_listings_count 13 1.00 1.508000e+01 158.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.508000e+03 ▇▁▁▁▁
latitude 0 1.00 -3.386000e+01 0.07 -3.414000e+01 -3.390000e+01 -3.388000e+01 -3.383000e+01 -3.340000e+01 ▁▇▃▁▁
longitude 0 1.00 1.512000e+02 0.09 1.506000e+02 1.511800e+02 1.512200e+02 1.512600e+02 1.513400e+02 ▁▁▁▃▇
accommodates 0 1.00 3.240000e+00 2.12 1.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
bedrooms 1998 0.94 1.660000e+00 1.05 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.600000e+01 ▇▁▁▁▁
beds 395 0.99 1.910000e+00 1.48 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.700000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 6.560000e+00 31.66 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.569800e+02 528.24 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 1.500000e+03 ▆▁▁▇▁
minimum_minimum_nights 0 1.00 6.400000e+00 31.04 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 6.980000e+00 31.48 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 7.620090e+05 40426410.17 1.000000e+00 3.200000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 1.869321e+06 63319675.46 1.000000e+00 3.500000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 6.680000e+00 31.22 1.000000e+00 1.300000e+00 3.000000e+00 5.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 1.866813e+06 63234821.62 1.000000e+00 3.500000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
availability_30 0 1.00 8.350000e+00 12.52 0.000000e+00 0.000000e+00 0.000000e+00 2.300000e+01 3.000000e+01 ▇▁▁▁▃
availability_60 0 1.00 1.794000e+01 25.68 0.000000e+00 0.000000e+00 0.000000e+00 5.300000e+01 6.000000e+01 ▇▁▁▁▃
availability_90 0 1.00 2.791000e+01 38.94 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+01 9.000000e+01 ▇▁▁▁▃
availability_365 0 1.00 8.871000e+01 130.94 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+02 3.650000e+02 ▇▁▁▁▂
number_of_reviews 0 1.00 1.479000e+01 38.25 0.000000e+00 0.000000e+00 2.000000e+00 1.000000e+01 8.360000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.250000e+00 8.84 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.690000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 4.000000e-02 0.31 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.400000e+01 ▇▁▁▁▁
review_scores_rating 9148 0.71 4.410000e+00 1.17 0.000000e+00 4.500000e+00 4.820000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 10324 0.67 4.740000e+00 0.53 0.000000e+00 4.700000e+00 4.930000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 10308 0.67 4.580000e+00 0.65 0.000000e+00 4.500000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 10336 0.67 4.830000e+00 0.45 0.000000e+00 4.850000e+00 5.000000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 10314 0.67 4.830000e+00 0.47 0.000000e+00 4.860000e+00 5.000000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 10335 0.67 4.820000e+00 0.40 0.000000e+00 4.800000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 10343 0.67 4.640000e+00 0.55 0.000000e+00 4.500000e+00 4.800000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 6.360000e+00 21.96 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.990000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 5.120000e+00 21.20 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.990000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 1.130000e+00 5.72 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 9.300000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 6.000000e-02 0.59 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.700000e+01 ▇▁▁▁▁
reviews_per_month 9148 0.71 6.400000e-01 1.31 1.000000e-02 5.000000e-02 1.500000e-01 6.800000e-01 5.400000e+01 ▇▁▁▁▁
listings <- listings %>% 
  mutate(price = parse_number(as.character(price))) #converting price from char to double

typeof(listings$price)
[1] "double"
listings_num_variables <- listings %>% 
  select(where(is.numeric))

dropping<- c("host_total_listings_count", "price", "scrape_id", "host_id", "latitude", "longitude", "minimum_minimum_night", "minimum_maximum_night", "maximum_minimum_night", "maximum_minimum_night", "availability_30",  "availability_60 ", "availability_90","availability_365","calculated_host_listings_count","calculated_host_listings_count_entire_homes","calculated_host_listings_count_private_rooms", "calculated_host_listings_count_shared_rooms", "minimum_minimum_nights","maximum_minimum_nights", "minimum_maximum_nights","maximum_maximum_nights", "minimum_nights_avg_ntm", "maximum_nights_avg_ntm")

listings_num_variables<- listings_num_variables[, !(names(listings_num_variables)%in% dropping)]  #Removed deplicated variables

listings_num_variables <- listings_num_variables %>% 
  filter(maximum_nights< 360) 

listings_num_variables

library(corrplot)

correl = cor(listings_num_variables, use="pairwise.complete.obs")
corrplot(correl, method = "circle") #let's look at the correlation in a different form

dropping2<- c("beds", "number_of_reviews_ltm","number_of_reviews_l30d","review_scores_accuracy", "review_scores_checkin", "review_scores_location", "review_scores_value", "review_per_month", "host_listings_count")

listings_num_variables<- listings_num_variables[, !(names(listings_num_variables)%in% dropping2)]  #Removed other less relevant variables
listings_num_variables

ggpairs(listings_num_variables) #Analyzing the correlations between the variables

skim(listings_num_variables)

From the initial dataset we have tried to eliminate all repetitive or less useful variables. We have initially statistically analyze the dataset to comprehend the situation at the beginning by utilizing skrim and listings. Later on, we modified the dataset and we start analyzing potential correlations which results from the graphs. Specifically, it appears that correlations do not result to be linear.

  • How many variables/columns? How many rows/observations?

We had 74 variables at the beginning, we achieved a final value of 11 after the different drops. We had 31030 observations at the beginnig, we achieved a final value of 12314 after modifying the dataset.

  • Which variables are numbers?

At the beginning they were: id, scrape_id, host_id, host_listings_count, host_total_listings_count, latitude, longitude, accommodates, bedrooms,beds, minimum_nights, maximum_nights, minimum_minimum_nights, maximum_minimum_nights, minimum_maximum_nights, maximum_maximum_nights, minimum_nights_avg_ntm, maximum_nights_avg_ntm, availability_30, availability_60, availability_90, availability_365, number_of_reviews, number_of_reviews_ltm, number_of_reviews_l30d, review_scores_rating review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month.

After modifying the dataset: id, accommodates, bedrooms, minimum_nights, maximum_nights, availability_60, number_of_reviews, review_scores_rating, review_scores_cleanliness, review_scores_communication, reviews_per_month. - Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?

listing_url, name, description, neighborhood_overview, picture_url, host_url, host_name, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_thumbnail_url, host_picture_url, host_neighbourhood, host_verifications, neighbourhood, neighbourhood_cleansed, property_type, room_type, bathrooms_text, amenities,price,license.

  • What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

The correlations demonstrate that the variables are not linearly related.Outside of the same typology of class (the different types of reviews for instance), we have correlation coefficients that are quite low. This is because many variables are not linearly correlated even though.

0.1 Data wrangling

listings <- listings %>% 
  mutate(price = parse_number(as.character(price))) #converting price from character to double

typeof(listings$price)
[1] "double"

Used typeof(listing$price) to confirm that price is now stored as a number.

0.2 Propery types

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

listings_category <- listings %>% #find the rankings of each category of property type
  group_by(property_type) %>%
  summarise(number_of_category=count(property_type)) %>% #counting each property type
  arrange(desc(number_of_category)) %>% # descending
  mutate(all_category=sum(number_of_category),category_proportion=number_of_category/all_category) %>% #property type as proportion of total number of categories
  head(4) #finding the top 4 most common property types

listings_category
property_typenumber_of_categoryall_categorycategory_proportion
Entire rental unit11678310300.376
Private room in rental unit5870310300.189
Entire residential home4499310300.145
Private room in residential home3740310300.121
listings_category_top_4 <- listings_category %>% #what proportion of all properties are made up by our top 4?
  summarise(top_four_category=sum(category_proportion))

listings_category_top_4
top_four_category
0.831

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

Use the code below to check that prop_type_simplified was correctly made.

listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit", "Entire residential home","Private room in residential home") ~ property_type, 
    TRUE ~ "Other"
  ))
listings %>% #the above new column displayed and compared with old classification
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))    
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit11678
Private room in rental unitPrivate room in rental unit5870
Entire residential homeEntire residential home4499
Private room in residential homePrivate room in residential home3740
Private room in townhouseOther666
Entire townhouseOther535
Entire guest suiteOther523
Entire guesthouseOther382
Entire condominium (condo)Other335
Shared room in rental unitOther332
Room in boutique hotelOther251
Private room in condominium (condo)Other239
Private room in villaOther147
Entire serviced apartmentOther135
Private room in guest suiteOther130
Room in hotelOther126
Entire loftOther124
Entire cottageOther122
Private room in guesthouseOther119
Entire villaOther118
Shared room in residential homeOther112
Entire bungalowOther87
Private room in bed and breakfastOther75
Private room in hostelOther74
Shared room in hostelOther51
Private room in bungalowOther46
Room in aparthotelOther43
Room in serviced apartmentOther39
Private room in loftOther34
Entire cabinOther33
Private room in serviced apartmentOther33
Entire placeOther32
Tiny houseOther29
Private roomOther22
Shared room in condominium (condo)Other20
BoatOther17
Private room in cabinOther15
Camper/RVOther12
Room in hostelOther12
Shared room in guesthouseOther12
Shared room in townhouseOther12
Shared room in villaOther12
Farm stayOther11
Private room in cottageOther10
Room in bed and breakfastOther9
Private room in tiny houseOther8
Shared room in bed and breakfastOther8
Private room in casa particularOther5
TentOther5
Earth houseOther4
Entire chaletOther4
Private room in boatOther4
Private room in earth houseOther4
Shared room in guest suiteOther4
BarnOther3
Entire home/aptOther3
FloorOther3
IslandOther3
Private room in camper/rvOther3
Private room in farm stayOther3
Private room in tentOther3
Casa particularOther2
Holiday parkOther2
Private room in barnOther2
Private room in busOther2
Private room in chaletOther2
Shared room in loftOther2
Shared room in serviced apartmentOther2
BusOther1
CampsiteOther1
CastleOther1
CaveOther1
Dome houseOther1
Private room in in-lawOther1
Private room in islandOther1
Private room in minsuOther1
Private room in nature lodgeOther1
Private room in pensionOther1
Private room in resortOther1
Private room in tipiOther1
Private room in trainOther1
Private room in yurtOther1
Room in resortOther1
Shared room in boatOther1
Shared room in boutique hotelOther1
Shared room in caveOther1
Shared room in cottageOther1
Shared room in earth houseOther1
Shared room in farm stayOther1
Shared room in tentOther1
Shared room in tiny houseOther1
TrainOther1
TreehouseOther1
YurtOther1

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?
listings %>% #by taking a count, we can figure out which values are most common
  count(minimum_nights) %>%
  arrange(desc(n))
minimum_nightsn
1       8203
2       7320
3       4468
7       3127
5       2677
4       1774
14       781
10       475
6       439
30       265
21       202
90       163
28       113
15       106
20       96
8       95
31       81
12       72
60       63
9       52
13       39
365       38
180       34
25       27
100       23
16       15
19       14
11       13
29       13
50       13
18       12
35       11
40       11
120       11
24       10
17       9
23       9
45       8
91       8
360       8
42       7
70       7
27       6
300       6
1e+03       6
56       5
150       5
200       5
1.12e+035
22       4
55       3
58       3
80       3
500       3
1.1e+03 3
26       2
32       2
34       2
47       2
48       2
84       2
92       2
183       2
222       2
240       2
364       2
33       1
37       1
44       1
49       1
51       1
62       1
74       1
75       1
83       1
85       1
87       1
89       1
93       1
94       1
95       1
96       1
99       1
115       1
130       1
132       1
149       1
152       1
168       1
178       1
179       1
182       1
185       1
190       1
198       1
199       1
211       1
220       1
256       1
280       1
333       1
395       1
700       1
999       1
1.12e+031
  • Is there any value among the common values that stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

7 days minimum stay is more common than 4,5 and 6 days, which stands out as it is for a longer period of time. but this can be justified since it is likely that rentors would like to have their properties in use for a week at a time rather than have the renting period finish randomly midweek.

1 Mapping

The following code, having downloaded a dataframe listings with all AirbnB listings in Sydney, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4).

leaflet(data = filter(listings, minimum_nights <= 4)) %>% #Using leaflet to display a map of the properties in our dataframe with minimum nights fewer than 4
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

2 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

We will create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

# in this part i delete neighbourhood_group_cleaned because it will be used in the analysis part(kostis asked us to do) and license(since it may have some impact in the final model)
drop_columns <- (c("id", #useless in our analysis
                 "listing_url", #useless in our analysis
                 "scrape_id", #useless in our analysis
                 "last_scraped", #useless in our analysis
                 "name", #useless in our analysis
                 "description", #useless in our analysis
                 "neighborhood_overview", #useless in our analysis
                 "picture_url", #useless in our analysis
                 "host_id", #useless in our analysis
                 "host_url", #useless in our analysis
                 "host_name", #useless in our analysis
                 "host_about", #useless in our analysis
                 "host_thumbnail_url", #useless in our analysis
                 "host_picture_url", #useless in our analysis
                 "bathrooms", #contains only NAs
                 "minimum_minimum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "maximum_minimum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "minimum_maximum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "maximum_maximum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "minimum_nights_avg_ntm", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "maximum_nights_avg_ntm", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
                 "calendar_updated", #contains only NAs
                 "calendar_last_scraped",
                 "first_review",                                
                 "last_review",
                 "calendar_updated",
                 "calculated_host_listings_count",
                 "calculated_host_listings_count_entire_homes",
                 "calculated_host_listings_count_private_rooms",
                 "calculated_host_listings_count_shared_rooms"))

listings_sydney <- listings %>% # creating a new dataframe without the useless columns, keeping our old df intact
  select(-drop_columns)


bathrooms_list<-unique(as.character(listings_sydney$bathrooms_text))

listings_sydney_2 <- listings_sydney %>% # we withdraw the numbers from the below strings
  mutate(bathrooms_number=case_when(bathrooms_text=="1 shared bath"~1,
                                    bathrooms_text=="3 baths"~3,
                                    bathrooms_text=="1 private bath"~1,
                                    bathrooms_text=="1 bath"~1,
                                    bathrooms_text=="1.5 shared baths"~1.5,
                                    bathrooms_text=="2.5 shared baths"~2.5,
                                    bathrooms_text=="2 baths"~2,
                                    bathrooms_text=="1.5 baths"~1.5,
                                    bathrooms_text=="2.5 baths"~2.5,
                                    bathrooms_text=="0 baths"~0,
                                    bathrooms_text=="2 shared baths"~2,
                                    bathrooms_text=="4 baths"~4,
                                    bathrooms_text=="3 shared baths"~3,
                                    bathrooms_text=="Half-bath"~0.5,
                                    bathrooms_text=="Shared half-bath"~0.5,
                                    bathrooms_text=="3.5 baths"~3.5,
                                    bathrooms_text=="3.5 shared baths"~3.5,
                                    bathrooms_text=="5 baths"~5,
                                    bathrooms_text=="4.5 baths"~4.5,
                                    bathrooms_text=="0 shared baths"~0,
                                    bathrooms_text=="6 baths"~6,
                                    bathrooms_text=="5.5 bathss"~5.5,
                                    bathrooms_text=="6 shared bath"~6,
                                    bathrooms_text=="Private half-bath"~0.5,
                                    bathrooms_text=="8 baths"~8,
                                    bathrooms_text=="4 shared baths"~4,
                                    bathrooms_text=="7 baths"~7,
                                    bathrooms_text=="6.5 baths"~6.5,
                                    bathrooms_text=="5.5 shared baths"~5.5,
                                    bathrooms_text=="4.5 shared baths"~4.5,
                                    bathrooms_text=="5 shared bathss"~5,
                                    bathrooms_text=="14.5 shared baths"~14.5,
                                    bathrooms_text=="7 shared baths"~7,
                                    bathrooms_text=="10 baths"~10))

#coordinates for Sydney Opera house: latitude -33.8568°, longitude  151.2153°
#forumla for distance between two coordinates: sqrt((x1-x2)^2+(y1-y2)^2)
listings_sydney_opera_distance <- listings_sydney_2 %>% 
  mutate(distance_opera=sqrt((latitude-(-33.8568))^2+(longitude-151.2153)^2))

listings_sydney_golden <- listings_sydney_opera_distance %>% #we now create our 'Golden' dataframe that we will use for the rest of our analysis
  filter(grepl("Sydney",host_location)) %>% #filter for location name to include "Sydney"
  filter(accommodates>=2, #to find price for 4 nights for 2 people first we restrict to those properties that can accommodate 2 or more people
         minimum_nights<=4) %>% #we can't consider properties that require you to stay for more than 4 nights and hence filter them out
  mutate(price_4_nights=price*4) %>% #we now take the price per night for the rooms that satisfy the above and multiply by 4 to get `price_4_nights`
  arrange(desc(price_4_nights)) %>% # note that we aren't calculating pro rata and multiplying by two as we are assuming that if 2 people book a property for 4 people, they still have to pay full price
  mutate(log_price_4_nights=log(price_4_nights))

We now use histograms and density plots to examine the distributions of price_4_nights and log_price_4_nights. Which variable should you use for the regression model? Why?

ggplot(listings_sydney_golden,aes(price_4_nights))+ #price_4_nights is a very right skewed data set
  geom_density(aes(x=price_4_nights))

ggplot(listings_sydney_golden,aes(price_4_nights))+ #...which leads to a very right skewed histogram 
  geom_histogram(aes(x=price_4_nights), bins=50)

ggplot(listings_sydney_golden,aes(log_price_4_nights))+ #log_price_4_nights is still right skewed but a lot less so
  geom_density(aes(x=log_price_4_nights))

ggplot(listings_sydney_golden,aes(log_price_4_nights))+ #... further highlighted by this histogram showing a close to normal distribution
  geom_histogram(aes(x=log_price_4_nights),bins=30)

We will fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

model1 <-lm(log_price_4_nights ~  prop_type_simplified+number_of_reviews+review_scores_rating, data = listings_sydney_golden)
model1 %>% 
  glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.3610.360.55346506-4.09e+038.19e+038.24e+031.51e+0349394946
msummary(model1) 
                                                       Estimate Std. Error
(Intercept)                                           6.2770556  0.0402906
prop_type_simplifiedEntire residential home           0.6901556  0.0262279
prop_type_simplifiedOther                            -0.1652345  0.0228788
prop_type_simplifiedPrivate room in rental unit      -0.6890471  0.0234864
prop_type_simplifiedPrivate room in residential home -0.7608048  0.0270010
number_of_reviews                                    -0.0003980  0.0001498
review_scores_rating                                  0.0263210  0.0085037
                                                     t value Pr(>|t|)    
(Intercept)                                          155.795  < 2e-16 ***
prop_type_simplifiedEntire residential home           26.314  < 2e-16 ***
prop_type_simplifiedOther                             -7.222  5.9e-13 ***
prop_type_simplifiedPrivate room in rental unit      -29.338  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -28.177  < 2e-16 ***
number_of_reviews                                     -2.657  0.00792 ** 
review_scores_rating                                   3.095  0.00198 ** 

Residual standard error: 0.5534 on 4939 degrees of freedom
  (1165 observations deleted due to missingness)
Multiple R-squared:  0.3611,    Adjusted R-squared:  0.3603 
F-statistic: 465.3 on 6 and 4939 DF,  p-value: < 2.2e-16
pairs.panels(listings_sydney_golden[c("prop_type_simplified","number_of_reviews","review_scores_rating")])

autoplot(model1)+theme_bw()

  • Interpret the coefficient review_scores_rating in terms of log_price_4_nights.

The coefficient is statistically significant and represents a 2.6% change in our Y for every 1 increase in rating.

  • Interpret the coefficient of prop_type_simplified in terms of log_price_4_nights.

The coefficients of each property type is significant. Each category can only take a value of 0 or 1, depending on if the property is in the category or not. generally, renting an “Entire residential home” will lead to an increase in price, whilst renting “Private room in rental unit”, “Private room in residential home”, or any other type of property will lead to a decrease in price- all highlighted by the sign of the coefficients.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

model2 <-lm(log_price_4_nights ~  prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data = listings_sydney_golden)
msummary(model2)
                                                       Estimate Std. Error
(Intercept)                                           6.3022284  0.0392597
prop_type_simplifiedEntire residential home           0.6893621  0.0255180
prop_type_simplifiedOther                             0.0636344  0.0273906
prop_type_simplifiedPrivate room in rental unit      -0.0756192  0.0479169
prop_type_simplifiedPrivate room in residential home -0.1463870  0.0496692
number_of_reviews                                    -0.0005262  0.0001460
review_scores_rating                                  0.0216264  0.0082846
room_typeHotel room                                   0.0279596  0.0840393
room_typePrivate room                                -0.6162552  0.0422157
room_typeShared room                                 -1.0707088  0.1129886
                                                     t value Pr(>|t|)    
(Intercept)                                          160.527  < 2e-16 ***
prop_type_simplifiedEntire residential home           27.015  < 2e-16 ***
prop_type_simplifiedOther                              2.323 0.020208 *  
prop_type_simplifiedPrivate room in rental unit       -1.578 0.114600    
prop_type_simplifiedPrivate room in residential home  -2.947 0.003221 ** 
number_of_reviews                                     -3.605 0.000316 ***
review_scores_rating                                   2.610 0.009070 ** 
room_typeHotel room                                    0.333 0.739378    
room_typePrivate room                                -14.598  < 2e-16 ***
room_typeShared room                                  -9.476  < 2e-16 ***

Residual standard error: 0.5384 on 4936 degrees of freedom
  (1165 observations deleted due to missingness)
Multiple R-squared:  0.3956,    Adjusted R-squared:  0.3945 
F-statistic:   359 on 9 and 4936 DF,  p-value: < 2.2e-16
pairs.panels(listings_sydney_golden[c("prop_type_simplified","number_of_reviews","review_scores_rating","room_type")])

This model seems worse than the one prior despite an increased R squared since we can see that “prop_type_simplifiedPrivate room in rental unit” and “room_typeHotel room” are both insignificant.

2.1 Further variables/questions to explore on our own

Our dataset has many more variables.

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
#we will have more analysis on new variables, so pervious variables should also be included

model_bathrooms <-lm(log_price_4_nights ~  bathrooms_number, data = listings_sydney_golden)
msummary(model_bathrooms)
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.56998    0.02076  268.28   <2e-16 ***
bathrooms_number  0.54493    0.01492   36.53   <2e-16 ***

Residual standard error: 0.6503 on 6102 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.1794,    Adjusted R-squared:  0.1793 
F-statistic:  1334 on 1 and 6102 DF,  p-value: < 2.2e-16

It seems as though it is!

#bedrooms
listings_sydney_bedrooms <- listings_sydney_golden
 #replacing NA values in bedrooms - using base R as recode is not working (we cannot use this way to change original number)

model_bedrooms <-lm(log_price_4_nights ~  bedrooms, data = listings_sydney_bedrooms)
msummary(model_bedrooms)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.477427   0.015678  349.36   <2e-16 ***
bedrooms    0.512737   0.008669   59.14   <2e-16 ***

Residual standard error: 0.5798 on 5617 degrees of freedom
  (492 observations deleted due to missingness)
Multiple R-squared:  0.3838,    Adjusted R-squared:  0.3836 
F-statistic:  3498 on 1 and 5617 DF,  p-value: < 2.2e-16

bedrooms works too!

#beds
listings_sydney_beds <- listings_sydney_golden
 #replacing NA values in bedrooms - using base R as recode is not working (we cannot use this method to change any original data)

model_beds <-lm(log_price_4_nights ~  beds, data = listings_sydney_beds)
msummary(model_beds)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.75530    0.01323  434.91   <2e-16 ***
beds         0.27846    0.00579   48.09   <2e-16 ***

Residual standard error: 0.6122 on 6055 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.2764,    Adjusted R-squared:  0.2763 
F-statistic:  2313 on 1 and 6055 DF,  p-value: < 2.2e-16

beds works as well!

model_accommodates <-lm(log_price_4_nights ~  accommodates, data = listings_sydney_golden)
msummary(model_accommodates)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.523914   0.014768  374.06   <2e-16 ***
accommodates 0.224385   0.003868   58.01   <2e-16 ***

Residual standard error: 0.5779 on 6109 degrees of freedom
Multiple R-squared:  0.3552,    Adjusted R-squared:  0.3551 
F-statistic:  3365 on 1 and 6109 DF,  p-value: < 2.2e-16

As expected, accommodates is also significant.

Now what happens when we put the above altogether?

#test colinearity between bathrooms_number, bedrooms, bdes and accommodates
model_4_variables <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates, data=listings_sydney_golden)
msummary(model_4_variables)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.385095   0.019640 274.194  < 2e-16 ***
bathrooms_number  0.065404   0.017105   3.824 0.000133 ***
bedrooms          0.313939   0.018199  17.250  < 2e-16 ***
beds             -0.028267   0.011244  -2.514 0.011969 *  
accommodates      0.110133   0.009221  11.944  < 2e-16 ***

Residual standard error: 0.5692 on 5562 degrees of freedom
  (544 observations deleted due to missingness)
Multiple R-squared:  0.404, Adjusted R-squared:  0.4035 
F-statistic: 942.4 on 4 and 5562 DF,  p-value: < 2.2e-16
car::vif(model_4_variables) #generally speaking, we should keep variables with vif ranging from 1 to 10, so all these four variables can be kept
bathrooms_number         bedrooms             beds     accommodates 
        1.664195         4.510377         4.112733         5.564048 
autoplot(model_4_variables)+theme_bw()

Each variable together gives us a better model than any of the prior but our vif>5 for accommodates

model_4_variables_final <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)
msummary(model_4_variables_final)
                                                       Estimate Std. Error
(Intercept)                                           5.679e+00  4.214e-02
bathrooms_number                                      1.459e-01  1.685e-02
bedrooms                                              2.165e-01  1.883e-02
beds                                                 -7.348e-03  1.072e-02
accommodates                                          4.286e-02  8.967e-03
prop_type_simplifiedEntire residential home           1.577e-01  2.784e-02
prop_type_simplifiedOther                            -1.486e-02  2.649e-02
prop_type_simplifiedPrivate room in rental unit      -1.159e-01  4.472e-02
prop_type_simplifiedPrivate room in residential home -2.570e-01  4.626e-02
number_of_reviews                                    -6.772e-05  1.407e-04
review_scores_rating                                  2.292e-02  7.712e-03
room_typeHotel room                                   2.554e-01  9.306e-02
room_typePrivate room                                -4.272e-01  4.044e-02
room_typeShared room                                 -8.936e-01  1.040e-01
                                                     t value Pr(>|t|)    
(Intercept)                                          134.769  < 2e-16 ***
bathrooms_number                                       8.658  < 2e-16 ***
bedrooms                                              11.496  < 2e-16 ***
beds                                                  -0.686  0.49297    
accommodates                                           4.780 1.81e-06 ***
prop_type_simplifiedEntire residential home            5.665 1.56e-08 ***
prop_type_simplifiedOther                             -0.561  0.57485    
prop_type_simplifiedPrivate room in rental unit       -2.593  0.00955 ** 
prop_type_simplifiedPrivate room in residential home  -5.555 2.94e-08 ***
number_of_reviews                                     -0.481  0.63043    
review_scores_rating                                   2.972  0.00298 ** 
room_typeHotel room                                    2.744  0.00609 ** 
room_typePrivate room                                -10.563  < 2e-16 ***
room_typeShared room                                  -8.592  < 2e-16 ***

Residual standard error: 0.4826 on 4491 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5421,    Adjusted R-squared:  0.5408 
F-statistic: 409.1 on 13 and 4491 DF,  p-value: < 2.2e-16
car::vif(model_4_variables_final) #generally speaking, we should keep variables with vif ranging from 1 to 5, so all these four variables can be kept, note that prop_type_simplified, room_type and bedrooms all have vif is greater than 5, so we remove prop_type_simplified and hope we can receive a reduced vif for the latter two in a future model.
                          GVIF Df GVIF^(1/(2*Df))
bathrooms_number      1.746976  1        1.321732
bedrooms              5.390611  1        2.321769
beds                  4.266013  1        2.065433
accommodates          6.073956  1        2.464540
prop_type_simplified 10.304991  4        1.338539
number_of_reviews     1.054508  1        1.026893
review_scores_rating  1.025472  1        1.012656
room_type             7.733543  3        1.406252
#more reasons such as why this happens
autoplot(model_4_variables_final)+theme_bw()

model_4_variables_final2 <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)
msummary(model_4_variables_final2)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.646e+00  4.200e-02 134.430  < 2e-16 ***
bathrooms_number       1.397e-01  1.690e-02   8.269  < 2e-16 ***
bedrooms               2.505e-01  1.810e-02  13.837  < 2e-16 ***
beds                  -5.033e-03  1.080e-02  -0.466  0.64129    
accommodates           4.140e-02  9.039e-03   4.580 4.77e-06 ***
number_of_reviews     -5.254e-05  1.409e-04  -0.373  0.70922    
review_scores_rating   2.416e-02  7.768e-03   3.110  0.00188 ** 
room_typeHotel room    2.348e-01  9.119e-02   2.574  0.01007 *  
room_typePrivate room -5.770e-01  1.782e-02 -32.379  < 2e-16 ***
room_typeShared room  -9.107e-01  1.022e-01  -8.906  < 2e-16 ***

Residual standard error: 0.4869 on 4495 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5336,    Adjusted R-squared:  0.5327 
F-statistic: 571.5 on 9 and 4495 DF,  p-value: < 2.2e-16
car::vif(model_4_variables_final2) 
                         GVIF Df GVIF^(1/(2*Df))
bathrooms_number     1.725377  1        1.313536
bedrooms             4.894539  1        2.212360
beds                 4.259119  1        2.063763
accommodates         6.065239  1        2.462771
number_of_reviews    1.038256  1        1.018949
review_scores_rating 1.022591  1        1.011233
room_type            1.379121  3        1.055036
autoplot(model_4_variables_final2)+theme_bw()

#note that accommodates' vif is greater than 5
#coordinates for Sydney Opera house: latitude -33.8568°, longitude  151.2153°
#forumla for distance between two coordinates: sqrt((x1-x2)^2+(y1-y2)^2)

model_opera <-lm(log_price_4_nights ~  distance_opera, data = listings_sydney_golden)
summary(model_opera)

Call:
lm(formula = log_price_4_nights ~ distance_opera, data = listings_sydney_golden)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4848 -0.4984 -0.0873  0.3854  4.6849 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.27267    0.01375 456.263   <2e-16 ***
distance_opera -0.08931    0.12737  -0.701    0.483    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7197 on 6109 degrees of freedom
Multiple R-squared:  8.048e-05, Adjusted R-squared:  -8.32e-05 
F-statistic: 0.4917 on 1 and 6109 DF,  p-value: 0.4832
  1. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
#categorical variables- superhost to show whether it has a pricing premium

model_superhost<-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_superhost)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.6497921  0.0420983 134.205  < 2e-16 ***
bathrooms_number       0.1383565  0.0169231   8.176 3.79e-16 ***
bedrooms               0.2510826  0.0181064  13.867  < 2e-16 ***
beds                  -0.0052422  0.0108025  -0.485  0.62750    
accommodates           0.0413960  0.0090381   4.580 4.77e-06 ***
host_is_superhostTRUE  0.0288629  0.0206824   1.396  0.16293    
number_of_reviews     -0.0001195  0.0001488  -0.803  0.42194    
review_scores_rating   0.0228035  0.0078283   2.913  0.00360 ** 
room_typeHotel room    0.2382824  0.0912135   2.612  0.00902 ** 
room_typePrivate room -0.5767763  0.0178185 -32.370  < 2e-16 ***
room_typeShared room  -0.9070708  0.1022715  -8.869  < 2e-16 ***

Residual standard error: 0.4868 on 4494 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5338,    Adjusted R-squared:  0.5328 
F-statistic: 514.6 on 10 and 4494 DF,  p-value: < 2.2e-16
car::vif(model_superhost)
                         GVIF Df GVIF^(1/(2*Df))
bathrooms_number     1.731141  1        1.315728
bedrooms             4.897131  1        2.212946
beds                 4.259938  1        2.063962
accommodates         6.065240  1        2.462771
host_is_superhost    1.151071  1        1.072880
number_of_reviews    1.158792  1        1.076472
review_scores_rating 1.038628  1        1.019131
room_type            1.381051  3        1.055281
# vif for accommodates is 6.06
  1. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
model_instant_bookable<- lm(log_price_4_nights~bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)

msummary(model_instant_bookable)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.6847640  0.0428951 132.527  < 2e-16 ***
bathrooms_number       0.1384559  0.0168939   8.196 3.22e-16 ***
bedrooms               0.2451502  0.0181338  13.519  < 2e-16 ***
beds                  -0.0064136  0.0107877  -0.595  0.55219    
accommodates           0.0441355  0.0090475   4.878 1.11e-06 ***
host_is_superhostTRUE  0.0302634  0.0206496   1.466  0.14283    
instant_bookableTRUE  -0.0615180  0.0151154  -4.070 4.78e-05 ***
number_of_reviews     -0.0001185  0.0001486  -0.797  0.42525    
review_scores_rating   0.0205140  0.0078350   2.618  0.00887 ** 
room_typeHotel room    0.2614332  0.0912335   2.866  0.00418 ** 
room_typePrivate room -0.5764499  0.0177879 -32.407  < 2e-16 ***
room_typeShared room  -0.9179767  0.1021300  -8.988  < 2e-16 ***

Residual standard error: 0.486 on 4493 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5355,    Adjusted R-squared:  0.5344 
F-statistic:   471 on 11 and 4493 DF,  p-value: < 2.2e-16
car::vif(model_instant_bookable)
                         GVIF Df GVIF^(1/(2*Df))
bathrooms_number     1.731145  1        1.315730
bedrooms             4.928978  1        2.220130
beds                 4.262972  1        2.064697
accommodates         6.098998  1        2.469615
host_is_superhost    1.151390  1        1.073029
instant_bookable     1.020612  1        1.010254
number_of_reviews    1.158796  1        1.076474
review_scores_rating 1.044010  1        1.021768
room_type            1.387471  3        1.056097
autoplot(model_instant_bookable)+theme_bw()

  1. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
listings_sydney_golden %>%
  group_by(neighbourhood_cleansed)%>%
  summarise(count=n()) %>%
  arrange(desc(count)) 
neighbourhood_cleansedcount
Sydney1872
Waverley739
Randwick476
Marrickville293
North Sydney265
Woollahra247
Warringah210
Leichhardt200
Manly179
Pittwater175
Rockdale130
Ryde109
Botany Bay107
Auburn103
Sutherland Shire77
Willoughby71
Canada Bay68
Mosman68
Parramatta66
Hornsby65
Canterbury62
Ku-Ring-Gai56
Burwood51
Lane Cove51
Ashfield50
Blacktown49
Bankstown39
The Hills Shire39
Hurstville38
City Of Kogarah29
Strathfield29
Penrith21
Fairfield20
Campbelltown15
Hunters Hill15
Liverpool12
Holroyd10
Camden5
# since we have already chosen great sydney as our target city, we will divide neighbourhoods based on their geographic locations into 5 parts-central sydney, east sydney, north sydney, west sydney and south sydeny
neighbourhood_location<-c("central sydney","east sydney","north sydney","west sydney","south sydney")
                          
listings_sydney_golden<-listings_sydney_golden %>%
  mutate(neighbourhood_simplified=case_when(
    neighbourhood_cleansed %in% c("Sydney") ~ "Central",
    neighbourhood_cleansed %in% c("Botany Bay","Camden","Waverley","Randwick","Woollahra") ~ "East",
    neighbourhood_cleansed %in% c("North Sydney","Warringah","Manly","Pittwater","Mosman","Hornsby","Ku-Ring-Gai","Lane Cove","Hunters Hill","Willoughby") ~ "North",
    neighbourhood_cleansed %in% c("Rockdale","Sutherland Shire","Hurstville","City of Kogarah") ~ "South",
    neighbourhood_cleansed %in% c("Marrickville","Leichhardt","Ryde","Auburn","Canada Bay","Parramatta","Canterbury","Burwood","Ashfield","Blacktown","Bankstown","The Hills Shire","Strathfield","Penrith","Fairfield","Campbelltown","Liverpool")~ "West", 
    TRUE ~ "Other"))

model_neighbourhood<-lm(log_price_4_nights~neighbourhood_simplified,data=listings_sydney_golden)
msummary(model_neighbourhood)
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.27224    0.01610 389.495  < 2e-16 ***
neighbourhood_simplifiedEast   0.02348    0.02383   0.985  0.32455    
neighbourhood_simplifiedNorth  0.27925    0.02607  10.712  < 2e-16 ***
neighbourhood_simplifiedOther -0.24591    0.11272  -2.182  0.02918 *  
neighbourhood_simplifiedSouth -0.15416    0.04734  -3.257  0.00113 ** 
neighbourhood_simplifiedWest  -0.28815    0.02560 -11.256  < 2e-16 ***

Residual standard error: 0.6967 on 6105 degrees of freedom
Multiple R-squared:  0.0634,    Adjusted R-squared:  0.06264 
F-statistic: 82.66 on 5 and 6105 DF,  p-value: < 2.2e-16
# locations:significant,but not significant in East Sydney. Why? Maybe because of the economic status of that part? find more PEST factors(needs more interpretation)

# then testing whether 'neighbourhood_simplified' is a significant predictor for price_4_nights by controlling other variables
model_neighbourhood_cleansed<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_neighbourhood_cleansed)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.7375381  0.0428014 134.050  < 2e-16 ***
neighbourhood_simplifiedEast  -0.0005874  0.0191315  -0.031  0.97551    
neighbourhood_simplifiedNorth  0.0571296  0.0207036   2.759  0.00581 ** 
neighbourhood_simplifiedOther -0.3770220  0.0935316  -4.031 5.65e-05 ***
neighbourhood_simplifiedSouth -0.2533750  0.0375317  -6.751 1.66e-11 ***
neighbourhood_simplifiedWest  -0.2835638  0.0203669 -13.923  < 2e-16 ***
bathrooms_number               0.1412038  0.0163555   8.633  < 2e-16 ***
bedrooms                       0.2438566  0.0177044  13.774  < 2e-16 ***
beds                          -0.0118459  0.0104310  -1.136  0.25616    
accommodates                   0.0503154  0.0087909   5.724 1.11e-08 ***
host_is_superhostTRUE          0.0490761  0.0201049   2.441  0.01469 *  
instant_bookableTRUE          -0.0603499  0.0146379  -4.123 3.81e-05 ***
number_of_reviews             -0.0001319  0.0001442  -0.915  0.36022    
review_scores_rating           0.0154186  0.0075722   2.036  0.04179 *  
room_typeHotel room            0.2191907  0.0881573   2.486  0.01294 *  
room_typePrivate room         -0.5381683  0.0173369 -31.042  < 2e-16 ***
room_typeShared room          -0.8397026  0.0986974  -8.508  < 2e-16 ***

Residual standard error: 0.4691 on 4488 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5677,    Adjusted R-squared:  0.5661 
F-statistic: 368.3 on 16 and 4488 DF,  p-value: < 2.2e-16
car::vif(model_neighbourhood_cleansed)
                             GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.128783  5        1.012188
bathrooms_number         1.741162  1        1.319531
bedrooms                 5.041776  1        2.245390
beds                     4.277070  1        2.068108
accommodates             6.178858  1        2.485731
host_is_superhost        1.171245  1        1.082241
instant_bookable         1.027119  1        1.013469
number_of_reviews        1.170844  1        1.082055
review_scores_rating     1.046425  1        1.022949
room_type                1.420312  3        1.060223
# with F statistic's p-value smaller than 0.001, the model itself is significant, and adjusted R-square is getting greater
autoplot(model_neighbourhood_cleansed)+theme_bw()

model_neighbourhood_cleansed2<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_neighbourhood_cleansed2)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.7381019  0.0422993 135.655  < 2e-16 ***
neighbourhood_simplifiedEast   0.0003965  0.0190050   0.021  0.98336    
neighbourhood_simplifiedNorth  0.0567617  0.0205813   2.758  0.00584 ** 
neighbourhood_simplifiedOther -0.3720965  0.0933802  -3.985 6.86e-05 ***
neighbourhood_simplifiedSouth -0.2568450  0.0373825  -6.871 7.26e-12 ***
neighbourhood_simplifiedWest  -0.2826136  0.0202875 -13.930  < 2e-16 ***
bathrooms_number               0.1412536  0.0163204   8.655  < 2e-16 ***
bedrooms                       0.2406349  0.0171944  13.995  < 2e-16 ***
accommodates                   0.0444407  0.0073031   6.085 1.26e-09 ***
host_is_superhostTRUE          0.0421184  0.0189995   2.217  0.02669 *  
instant_bookableTRUE          -0.0596457  0.0145907  -4.088 4.43e-05 ***
review_scores_rating           0.0152765  0.0075186   2.032  0.04223 *  
room_typeHotel room            0.2132266  0.0879698   2.424  0.01540 *  
room_typePrivate room         -0.5388664  0.0172415 -31.254  < 2e-16 ***
room_typeShared room          -0.8588854  0.0968081  -8.872  < 2e-16 ***

Residual standard error: 0.4688 on 4505 degrees of freedom
  (1591 observations deleted due to missingness)
Multiple R-squared:  0.5679,    Adjusted R-squared:  0.5665 
F-statistic: 422.8 on 14 and 4505 DF,  p-value: < 2.2e-16
car::vif(model_neighbourhood_cleansed2)
                             GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.112782  5        1.010744
bathrooms_number         1.737565  1        1.318167
bedrooms                 4.768603  1        2.183713
accommodates             4.276853  1        2.068055
host_is_superhost        1.050257  1        1.024821
instant_bookable         1.026558  1        1.013192
review_scores_rating     1.043291  1        1.021416
room_type                1.359120  3        1.052470
#anova part is to figure out whether neighbourhood_cleansed has impact on model, since F statistic is large enough and p-value is smaller than 0.001, this variable should be kept
anova(model_instant_bookable,model_neighbourhood_cleansed)
Res.DfRSSDfSum of SqFPr(>F)
4.49e+031.06e+03          
4.49e+03988       573.466.72.1e-67
#not sure what it is for, in order to control other variables, it should first run a linear regression model without this variable, but containing other variables(like x1,x2 ...)and then run a new linear regress model containing all variables 
model_immediate_booking<-lm(log_price_4_nights~has_availability,data=listings_sydney_golden)
msummary(model_immediate_booking)
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.23173    0.09291  67.071   <2e-16 ***
has_availabilityTRUE  0.03412    0.09337   0.365    0.715    

Residual standard error: 0.7197 on 6109 degrees of freedom
Multiple R-squared:  2.186e-05, Adjusted R-squared:  -0.0001418 
F-statistic: 0.1335 on 1 and 6109 DF,  p-value: 0.7148
# the factor is not significant on alpha=0.001, so it means we should drop this variable in new models

#after adding other variables, it turns out that there is no pricing premium with the variable of model_immediate_booking, since the adjusted R-squared is still 0.447 and the p-value for has_availability is greater than 0.01, so we decided to drop this variable since it cannot be a useful variable for predictions
model_immediate_booking_final<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_immediate_booking_final)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.7375381  0.0428014 134.050  < 2e-16 ***
neighbourhood_simplifiedEast  -0.0005874  0.0191315  -0.031  0.97551    
neighbourhood_simplifiedNorth  0.0571296  0.0207036   2.759  0.00581 ** 
neighbourhood_simplifiedOther -0.3770220  0.0935316  -4.031 5.65e-05 ***
neighbourhood_simplifiedSouth -0.2533750  0.0375317  -6.751 1.66e-11 ***
neighbourhood_simplifiedWest  -0.2835638  0.0203669 -13.923  < 2e-16 ***
bathrooms_number               0.1412038  0.0163555   8.633  < 2e-16 ***
bedrooms                       0.2438566  0.0177044  13.774  < 2e-16 ***
beds                          -0.0118459  0.0104310  -1.136  0.25616    
accommodates                   0.0503154  0.0087909   5.724 1.11e-08 ***
host_is_superhostTRUE          0.0490761  0.0201049   2.441  0.01469 *  
instant_bookableTRUE          -0.0603499  0.0146379  -4.123 3.81e-05 ***
number_of_reviews             -0.0001319  0.0001442  -0.915  0.36022    
review_scores_rating           0.0154186  0.0075722   2.036  0.04179 *  
room_typeHotel room            0.2191907  0.0881573   2.486  0.01294 *  
room_typePrivate room         -0.5381683  0.0173369 -31.042  < 2e-16 ***
room_typeShared room          -0.8397026  0.0986974  -8.508  < 2e-16 ***

Residual standard error: 0.4691 on 4488 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5677,    Adjusted R-squared:  0.5661 
F-statistic: 368.3 on 16 and 4488 DF,  p-value: < 2.2e-16
car::vif(model_immediate_booking_final)
                             GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.128783  5        1.012188
bathrooms_number         1.741162  1        1.319531
bedrooms                 5.041776  1        2.245390
beds                     4.277070  1        2.068108
accommodates             6.178858  1        2.485731
host_is_superhost        1.171245  1        1.082241
instant_bookable         1.027119  1        1.013469
number_of_reviews        1.170844  1        1.082055
review_scores_rating     1.046425  1        1.022949
room_type                1.420312  3        1.060223
autoplot(model_immediate_booking_final)+theme_bw()

  1. What is the effect of availability_30 or reviews_per_month on price_4_nights, after we control for other variables?
model_availability<-lm(log_price_4_nights~availability_30+neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_availability)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.7110007  0.0426509 133.901  < 2e-16 ***
availability_30                0.0045961  0.0005866   7.836 5.79e-15 ***
neighbourhood_simplifiedEast  -0.0012104  0.0190042  -0.064   0.9492    
neighbourhood_simplifiedNorth  0.0499703  0.0205860   2.427   0.0152 *  
neighbourhood_simplifiedOther -0.3849392  0.0929141  -4.143 3.49e-05 ***
neighbourhood_simplifiedSouth -0.2589444  0.0372884  -6.944 4.35e-12 ***
neighbourhood_simplifiedWest  -0.2925903  0.0202640 -14.439  < 2e-16 ***
bathrooms_number               0.1399203  0.0162473   8.612  < 2e-16 ***
bedrooms                       0.2464159  0.0175895  14.009  < 2e-16 ***
beds                          -0.0136642  0.0103641  -1.318   0.1874    
accommodates                   0.0487059  0.0087348   5.576 2.60e-08 ***
host_is_superhostTRUE          0.0330220  0.0200758   1.645   0.1001    
instant_bookableTRUE          -0.0576433  0.0145445  -3.963 7.51e-05 ***
number_of_reviews             -0.0002890  0.0001446  -1.999   0.0457 *  
review_scores_rating           0.0173582  0.0075258   2.306   0.0211 *  
room_typeHotel room            0.1735356  0.0877636   1.977   0.0481 *  
room_typePrivate room         -0.5473269  0.0172610 -31.709  < 2e-16 ***
room_typeShared room          -0.8501996  0.0980491  -8.671  < 2e-16 ***

Residual standard error: 0.466 on 4487 degrees of freedom
  (1606 observations deleted due to missingness)
Multiple R-squared:  0.5735,    Adjusted R-squared:  0.5719 
F-statistic: 354.9 on 17 and 4487 DF,  p-value: < 2.2e-16
car::vif(model_availability)
                             GVIF Df GVIF^(1/(2*Df))
availability_30          1.062963  1        1.031001
neighbourhood_simplified 1.134297  5        1.012681
bathrooms_number         1.741339  1        1.319598
bedrooms                 5.043515  1        2.245777
beds                     4.279216  1        2.068626
accommodates             6.182276  1        2.486418
host_is_superhost        1.183573  1        1.087921
instant_bookable         1.027699  1        1.013755
number_of_reviews        1.193796  1        1.092610
review_scores_rating     1.047558  1        1.023503
room_type                1.432329  3        1.061713
autoplot(model_availability)+theme_bw()

#this time we found that host_is_superhost, beds and number of reviews are not significant any more, and adjusted R-squared is much greater than previous models, so considering about whether we need to drop these variable

model_availability<-lm(log_price_4_nights~availability_30+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_availability)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.7057368  0.0421063 135.508  < 2e-16 ***
availability_30                0.0044816  0.0005718   7.838 5.69e-15 ***
neighbourhood_simplifiedEast   0.0008903  0.0188756   0.047   0.9624    
neighbourhood_simplifiedNorth  0.0520134  0.0204545   2.543   0.0110 *  
neighbourhood_simplifiedOther -0.3779571  0.0928021  -4.073 4.73e-05 ***
neighbourhood_simplifiedSouth -0.2566638  0.0369541  -6.945 4.31e-12 ***
neighbourhood_simplifiedWest  -0.2892724  0.0201795 -14.335  < 2e-16 ***
bathrooms_number               0.1415615  0.0162029   8.737  < 2e-16 ***
bedrooms                       0.2430378  0.0170733  14.235  < 2e-16 ***
accommodates                   0.0417181  0.0072657   5.742 9.99e-09 ***
instant_bookableTRUE          -0.0569918  0.0145020  -3.930 8.62e-05 ***
review_scores_rating           0.0180658  0.0073835   2.447   0.0145 *  
room_typeHotel room            0.1632153  0.0876349   1.862   0.0626 .  
room_typePrivate room         -0.5475177  0.0171508 -31.924  < 2e-16 ***
room_typeShared room          -0.8732446  0.0961574  -9.081  < 2e-16 ***

Residual standard error: 0.4659 on 4505 degrees of freedom
  (1591 observations deleted due to missingness)
Multiple R-squared:  0.5732,    Adjusted R-squared:  0.5719 
F-statistic: 432.2 on 14 and 4505 DF,  p-value: < 2.2e-16
car::vif(model_availability)
                             GVIF Df GVIF^(1/(2*Df))
availability_30          1.016872  1        1.008401
neighbourhood_simplified 1.099250  5        1.009508
bathrooms_number         1.734104  1        1.316854
bedrooms                 4.760599  1        2.181880
accommodates             4.286254  1        2.070327
instant_bookable         1.026831  1        1.013327
review_scores_rating     1.018722  1        1.009318
room_type                1.366330  3        1.053398
autoplot(model_availability)+theme_bw()

# with this model, we found that host_is_superhost is significant again, but reviews_per_month is not significant.
model_reviews<-lm(log_price_4_nights~reviews_per_month+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)

msummary(model_reviews)
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.745472   0.042354 135.655  < 2e-16 ***
reviews_per_month             -0.013421   0.004883  -2.748  0.00601 ** 
neighbourhood_simplifiedEast  -0.003304   0.019039  -0.174  0.86225    
neighbourhood_simplifiedNorth  0.054003   0.020591   2.623  0.00875 ** 
neighbourhood_simplifiedOther -0.375201   0.093319  -4.021 5.90e-05 ***
neighbourhood_simplifiedSouth -0.257738   0.037357  -6.899 5.95e-12 ***
neighbourhood_simplifiedWest  -0.284144   0.020280 -14.011  < 2e-16 ***
bathrooms_number               0.139544   0.016320   8.550  < 2e-16 ***
bedrooms                       0.238226   0.017204  13.847  < 2e-16 ***
accommodates                   0.045393   0.007306   6.213 5.67e-10 ***
host_is_superhostTRUE          0.056827   0.019725   2.881  0.00398 ** 
instant_bookableTRUE          -0.057068   0.014610  -3.906 9.52e-05 ***
review_scores_rating           0.016498   0.007526   2.192  0.02843 *  
room_typeHotel room            0.227009   0.088049   2.578  0.00996 ** 
room_typePrivate room         -0.543864   0.017325 -31.392  < 2e-16 ***
room_typeShared room          -0.865254   0.096766  -8.942  < 2e-16 ***

Residual standard error: 0.4685 on 4504 degrees of freedom
  (1591 observations deleted due to missingness)
Multiple R-squared:  0.5686,    Adjusted R-squared:  0.5671 
F-statistic: 395.7 on 15 and 4504 DF,  p-value: < 2.2e-16
car::vif(model_reviews)
                             GVIF Df GVIF^(1/(2*Df))
reviews_per_month        1.138613  1        1.067058
neighbourhood_simplified 1.119083  5        1.011314
bathrooms_number         1.740093  1        1.319126
bedrooms                 4.781013  1        2.186553
accommodates             4.286485  1        2.070383
host_is_superhost        1.133700  1        1.064753
instant_bookable         1.030805  1        1.015286
review_scores_rating     1.046942  1        1.023202
room_type                1.380336  3        1.055190
# then run a model with all other variables we used before as instructions and 'reviews_per_month','availability_30' and host is superhost is siginificant again
model_reviews_availability<-lm(log_price_4_nights~availability_30+reviews_per_month+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_reviews_availability)
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.717299   0.042178 135.551  < 2e-16 ***
availability_30                0.004861   0.000588   8.267  < 2e-16 ***
reviews_per_month             -0.021086   0.004935  -4.273 1.97e-05 ***
neighbourhood_simplifiedEast  -0.004090   0.018898  -0.216 0.828680    
neighbourhood_simplifiedNorth  0.046369   0.020459   2.266 0.023473 *  
neighbourhood_simplifiedOther -0.382915   0.092634  -4.134 3.64e-05 ***
neighbourhood_simplifiedSouth -0.262488   0.037085  -7.078 1.69e-12 ***
neighbourhood_simplifiedWest  -0.292827   0.020158 -14.527  < 2e-16 ***
bathrooms_number               0.138071   0.016201   8.523  < 2e-16 ***
bedrooms                       0.240459   0.017079  14.079  < 2e-16 ***
accommodates                   0.042718   0.007259   5.885 4.28e-09 ***
host_is_superhostTRUE          0.040632   0.019677   2.065 0.038988 *  
instant_bookableTRUE          -0.052906   0.014511  -3.646 0.000269 ***
review_scores_rating           0.018940   0.007476   2.533 0.011335 *  
room_typeHotel room            0.181586   0.087570   2.074 0.038174 *  
room_typePrivate room         -0.555462   0.017254 -32.194  < 2e-16 ***
room_typeShared room          -0.880860   0.096069  -9.169  < 2e-16 ***

Residual standard error: 0.465 on 4503 degrees of freedom
  (1591 observations deleted due to missingness)
Multiple R-squared:  0.575, Adjusted R-squared:  0.5735 
F-statistic: 380.8 on 16 and 4503 DF,  p-value: < 2.2e-16
car::vif(model_reviews_availability)
                             GVIF Df GVIF^(1/(2*Df))
availability_30          1.079408  1        1.038945
reviews_per_month        1.180264  1        1.086399
neighbourhood_simplified 1.123912  5        1.011750
bathrooms_number         1.740303  1        1.319206
bedrooms                 4.782210  1        2.186826
accommodates             4.295017  1        2.072442
host_is_superhost        1.145048  1        1.070069
instant_bookable         1.032047  1        1.015897
review_scores_rating     1.048578  1        1.024001
room_type                1.394268  3        1.056958
autoplot(model_reviews_availability)+theme_bw()

#anova part is to figure out whether number of reviews has impact on model, since F statistic is large enough and p-value is smaller than 0.001, this variable should be kept
anova(model_availability,model_reviews_availability)
Res.DfRSSDfSum of SqFPr(>F)
4.50e+03978             
4.5e+03 97424.189.676.43e-05

2.2 Diagnostics, collinearity, summary tables

listings_sydney_golden1 <- listings_sydney_golden %>% 
  select(log_price_4_nights,bathrooms_number, distance_opera,host_since) %>% 
  ggpairs()

# first, we uses all variables mentioned above and create a residual plot
listings_sydney_golden2 <- listings_sydney_golden %>% 
  select(log_price_4_nights,#numerical variables only selected, categorical not selected for simplicity
             host_listings_count,
             accommodates,
             bathrooms_number,
             bedrooms,
             beds,
             availability_30, #only keeping availablity in 30 days since the others add little value- tested using corr
             number_of_reviews,
             review_scores_rating,
             reviews_per_month,
             distance_opera) %>% 
  ggpairs(size=1)

listings_sydney_golden2

#in this part, we need to delete some high colineared variables(correlation >=0.7-according to the pearson correlation theory) beds and bedrooms has high correlation, meanwhile number of reviews and review per month have high correlation
  1. We created a summary table, using huxtable that shows which models we worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
# produce summary table comparing models using huxtable::huxreg()
huxreg(model1, model2, model_4_variables,model_4_variables_final,model_superhost, model_instant_bookable,model_neighbourhood_cleansed,model_reviews,model_availability,model_reviews_availability,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
#       bold_signif = 0.05, 
       stars = NULL
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)
(Intercept)6.277 6.302 5.385 5.679 5.650 5.685 5.738 5.745 5.706 5.717 
(0.040)(0.039)(0.020)(0.042)(0.042)(0.043)(0.043)(0.042)(0.042)(0.042)
prop_type_simplifiedEntire residential home0.690 0.689      0.158                               
(0.026)(0.026)     (0.028)                              
prop_type_simplifiedOther-0.165 0.064      -0.015                               
(0.023)(0.027)     (0.026)                              
prop_type_simplifiedPrivate room in rental unit-0.689 -0.076      -0.116                               
(0.023)(0.048)     (0.045)                              
prop_type_simplifiedPrivate room in residential home-0.761 -0.146      -0.257                               
(0.027)(0.050)     (0.046)                              
number_of_reviews-0.000 -0.001      -0.000 -0.000 -0.000 -0.000                
(0.000)(0.000)     (0.000)(0.000)(0.000)(0.000)               
review_scores_rating0.026 0.022      0.023 0.023 0.021 0.015 0.016 0.018 0.019 
(0.009)(0.008)     (0.008)(0.008)(0.008)(0.008)(0.008)(0.007)(0.007)
room_typeHotel room     0.028      0.255 0.238 0.261 0.219 0.227 0.163 0.182 
     (0.084)     (0.093)(0.091)(0.091)(0.088)(0.088)(0.088)(0.088)
room_typePrivate room     -0.616      -0.427 -0.577 -0.576 -0.538 -0.544 -0.548 -0.555 
     (0.042)     (0.040)(0.018)(0.018)(0.017)(0.017)(0.017)(0.017)
room_typeShared room     -1.071      -0.894 -0.907 -0.918 -0.840 -0.865 -0.873 -0.881 
     (0.113)     (0.104)(0.102)(0.102)(0.099)(0.097)(0.096)(0.096)
bathrooms_number          0.065 0.146 0.138 0.138 0.141 0.140 0.142 0.138 
          (0.017)(0.017)(0.017)(0.017)(0.016)(0.016)(0.016)(0.016)
bedrooms          0.314 0.217 0.251 0.245 0.244 0.238 0.243 0.240 
          (0.018)(0.019)(0.018)(0.018)(0.018)(0.017)(0.017)(0.017)
beds          -0.028 -0.007 -0.005 -0.006 -0.012                
          (0.011)(0.011)(0.011)(0.011)(0.010)               
accommodates          0.110 0.043 0.041 0.044 0.050 0.045 0.042 0.043 
          (0.009)(0.009)(0.009)(0.009)(0.009)(0.007)(0.007)(0.007)
host_is_superhostTRUE                    0.029 0.030 0.049 0.057      0.041 
                    (0.021)(0.021)(0.020)(0.020)     (0.020)
instant_bookableTRUE                         -0.062 -0.060 -0.057 -0.057 -0.053 
                         (0.015)(0.015)(0.015)(0.015)(0.015)
neighbourhood_simplifiedEast                              -0.001 -0.003 0.001 -0.004 
                              (0.019)(0.019)(0.019)(0.019)
neighbourhood_simplifiedNorth                              0.057 0.054 0.052 0.046 
                              (0.021)(0.021)(0.020)(0.020)
neighbourhood_simplifiedOther                              -0.377 -0.375 -0.378 -0.383 
                              (0.094)(0.093)(0.093)(0.093)
neighbourhood_simplifiedSouth                              -0.253 -0.258 -0.257 -0.262 
                              (0.038)(0.037)(0.037)(0.037)
neighbourhood_simplifiedWest                              -0.284 -0.284 -0.289 -0.293 
                              (0.020)(0.020)(0.020)(0.020)
reviews_per_month                                   -0.013      -0.021 
                                   (0.005)     (0.005)
availability_30                                        0.004 0.005 
                                        (0.001)(0.001)
#observations4946     4946     5567     4505     4505     4505     4505     4520     4520     4520     
R squared0.361 0.396 0.404 0.542 0.534 0.536 0.568 0.569 0.573 0.575 
Adj. R Squared0.360 0.394 0.404 0.541 0.533 0.534 0.566 0.567 0.572 0.574 
Residual SE0.553 0.538 0.569 0.483 0.487 0.486 0.469 0.468 0.466 0.465 
#by adding more variables+ pervious ones like distance_opera,license, etc. and more categorical variables in final model with relatively highest adjusted R squared and ensure that all variables are siginificant


listings_sydney_golden_final<-listings_sydney_golden %>%
  mutate(host_response_rate=parse_number(host_response_rate),
         host_acceptance_rate=parse_number(host_acceptance_rate),
         amenities_number=length(list(amenities)))

for(i in 1:6111){
  listings_sydney_golden_final$amenities_words[i] <- lengths(strsplit(listings_sydney_golden_final$amenities[i],","))} #selecting the data in which we are interested

for(i in 1:6111){
  listings_sydney_golden_final$host_verification_words[i] <- lengths(strsplit(listings_sydney_golden_final$host_verifications[i],","))}#selecting the data in which we are interested

#best model currently we have
model_final<-lm(log_price_4_nights~number_of_reviews+bathrooms_number+bedrooms+review_scores_rating+room_type+host_response_rate+host_identity_verified+availability_90+review_scores_communication+review_scores_value+latitude+longitude+host_verification_words,data=listings_sydney_golden_final) 
msummary(model_final)
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -2.668e+02  2.097e+01 -12.724  < 2e-16 ***
number_of_reviews           -6.832e-04  1.633e-04  -4.184 3.04e-05 ***
bathrooms_number             1.780e-01  2.486e-02   7.159 1.29e-12 ***
bedrooms                     2.707e-01  1.709e-02  15.843  < 2e-16 ***
review_scores_rating         4.607e-01  5.450e-02   8.453  < 2e-16 ***
room_typeHotel room          2.196e-01  9.054e-02   2.425  0.01542 *  
room_typePrivate room       -6.399e-01  3.000e-02 -21.329  < 2e-16 ***
room_typeShared room        -1.509e+00  3.120e-01  -4.837 1.46e-06 ***
host_response_rate          -2.296e-03  5.842e-04  -3.930 8.89e-05 ***
host_identity_verifiedTRUE  -1.007e-01  4.538e-02  -2.218  0.02670 *  
availability_90             -6.670e-04  3.126e-04  -2.134  0.03305 *  
review_scores_communication -1.311e-01  4.516e-02  -2.904  0.00374 ** 
review_scores_value         -3.247e-01  4.907e-02  -6.616 5.17e-11 ***
latitude                     8.172e-01  1.372e-01   5.956 3.23e-09 ***
longitude                    1.989e+00  1.311e-01  15.175  < 2e-16 ***
host_verification_words     -1.334e-02  6.282e-03  -2.124  0.03388 *  

Residual standard error: 0.4332 on 1446 degrees of freedom
  (4649 observations deleted due to missingness)
Multiple R-squared:  0.675, Adjusted R-squared:  0.6716 
F-statistic: 200.2 on 15 and 1446 DF,  p-value: < 2.2e-16
car::vif(model_final)
                                GVIF Df GVIF^(1/(2*Df))
number_of_reviews           1.098407  1        1.048049
bathrooms_number            1.991934  1        1.411359
bedrooms                    2.261568  1        1.503851
review_scores_rating        4.146465  1        2.036287
room_type                   1.295544  3        1.044100
host_response_rate          1.105556  1        1.051454
host_identity_verified      1.208941  1        1.099518
availability_90             1.050237  1        1.024811
review_scores_communication 2.069786  1        1.438675
review_scores_value         3.461040  1        1.860387
latitude                    1.093461  1        1.045687
longitude                   1.100237  1        1.048922
host_verification_words     1.253325  1        1.119520
#only contains numeric variables
listings_sydney_golden_final_corr<-listings_sydney_golden_final%>%
  select(log_price_4_nights,
    number_of_reviews,
         bathrooms_number,
         bedrooms,
         review_scores_rating,
         host_response_rate,
         availability_90,
         review_scores_communication,
         review_scores_value,
         latitude,
         longitude,
         host_verification_words)

cor_listings_sydney<- cor(listings_sydney_golden_final_corr, method="pearson")
cor_listings_sydney
                            log_price_4_nights number_of_reviews
log_price_4_nights                 1.000000000      -0.007192563
number_of_reviews                 -0.007192563       1.000000000
bathrooms_number                            NA                NA
bedrooms                                    NA                NA
review_scores_rating                        NA                NA
host_response_rate                          NA                NA
availability_90                    0.078100956       0.182522341
review_scores_communication                 NA                NA
review_scores_value                         NA                NA
latitude                           0.213186779       0.017569060
longitude                          0.265463676      -0.011502634
host_verification_words            0.038405025       0.133125401
                            bathrooms_number bedrooms review_scores_rating
log_price_4_nights                        NA       NA                   NA
number_of_reviews                         NA       NA                   NA
bathrooms_number                           1       NA                   NA
bedrooms                                  NA        1                   NA
review_scores_rating                      NA       NA                    1
host_response_rate                        NA       NA                   NA
availability_90                           NA       NA                   NA
review_scores_communication               NA       NA                   NA
review_scores_value                       NA       NA                   NA
latitude                                  NA       NA                   NA
longitude                                 NA       NA                   NA
host_verification_words                   NA       NA                   NA
                            host_response_rate availability_90
log_price_4_nights                          NA      0.07810096
number_of_reviews                           NA      0.18252234
bathrooms_number                            NA              NA
bedrooms                                    NA              NA
review_scores_rating                        NA              NA
host_response_rate                           1              NA
availability_90                             NA      1.00000000
review_scores_communication                 NA              NA
review_scores_value                         NA              NA
latitude                                    NA      0.08897066
longitude                                   NA     -0.13253443
host_verification_words                     NA     -0.06565404
                            review_scores_communication review_scores_value
log_price_4_nights                                   NA                  NA
number_of_reviews                                    NA                  NA
bathrooms_number                                     NA                  NA
bedrooms                                             NA                  NA
review_scores_rating                                 NA                  NA
host_response_rate                                   NA                  NA
availability_90                                      NA                  NA
review_scores_communication                           1                  NA
review_scores_value                                  NA                   1
latitude                                             NA                  NA
longitude                                            NA                  NA
host_verification_words                              NA                  NA
                              latitude   longitude host_verification_words
log_price_4_nights          0.21318678  0.26546368              0.03840502
number_of_reviews           0.01756906 -0.01150263              0.13312540
bathrooms_number                    NA          NA                      NA
bedrooms                            NA          NA                      NA
review_scores_rating                NA          NA                      NA
host_response_rate                  NA          NA                      NA
availability_90             0.08897066 -0.13253443             -0.06565404
review_scores_communication         NA          NA                      NA
review_scores_value                 NA          NA                      NA
latitude                    1.00000000  0.08090837              0.01474355
longitude                   0.08090837  1.00000000              0.10736374
host_verification_words     0.01474355  0.10736374              1.00000000
corrplot(cor_listings_sydney,method="color",type="lower",tl.cex=1)
corrplot(cor_listings_sydney,method="pie",type="upper",add=TRUE,tl.cex=1,cl.cex=0.5)

#comparing majority of models created- removing a couple with lower R squared as only 9 are displayed
huxreg(model_superhost, model_instant_bookable,model_neighbourhood_cleansed,model_neighbourhood_cleansed2,model_reviews,model_availability,model_reviews_availability,model_final,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
#       bold_signif = 0.05, 
       stars = NULL
) %>% 
  set_caption('Comparison of all models')
Comparison of all models
(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)5.650 5.685 5.738 5.738 5.745 5.706 5.717 -266.781 
(0.042)(0.043)(0.043)(0.042)(0.042)(0.042)(0.042)(20.967)
bathrooms_number0.138 0.138 0.141 0.141 0.140 0.142 0.138 0.178 
(0.017)(0.017)(0.016)(0.016)(0.016)(0.016)(0.016)(0.025)
bedrooms0.251 0.245 0.244 0.241 0.238 0.243 0.240 0.271 
(0.018)(0.018)(0.018)(0.017)(0.017)(0.017)(0.017)(0.017)
beds-0.005 -0.006 -0.012                          
(0.011)(0.011)(0.010)                         
accommodates0.041 0.044 0.050 0.044 0.045 0.042 0.043      
(0.009)(0.009)(0.009)(0.007)(0.007)(0.007)(0.007)     
host_is_superhostTRUE0.029 0.030 0.049 0.042 0.057      0.041      
(0.021)(0.021)(0.020)(0.019)(0.020)     (0.020)     
number_of_reviews-0.000 -0.000 -0.000                     -0.001 
(0.000)(0.000)(0.000)                    (0.000)
review_scores_rating0.023 0.021 0.015 0.015 0.016 0.018 0.019 0.461 
(0.008)(0.008)(0.008)(0.008)(0.008)(0.007)(0.007)(0.055)
room_typeHotel room0.238 0.261 0.219 0.213 0.227 0.163 0.182 0.220 
(0.091)(0.091)(0.088)(0.088)(0.088)(0.088)(0.088)(0.091)
room_typePrivate room-0.577 -0.576 -0.538 -0.539 -0.544 -0.548 -0.555 -0.640 
(0.018)(0.018)(0.017)(0.017)(0.017)(0.017)(0.017)(0.030)
room_typeShared room-0.907 -0.918 -0.840 -0.859 -0.865 -0.873 -0.881 -1.509 
(0.102)(0.102)(0.099)(0.097)(0.097)(0.096)(0.096)(0.312)
instant_bookableTRUE     -0.062 -0.060 -0.060 -0.057 -0.057 -0.053      
     (0.015)(0.015)(0.015)(0.015)(0.015)(0.015)     
neighbourhood_simplifiedEast          -0.001 0.000 -0.003 0.001 -0.004      
          (0.019)(0.019)(0.019)(0.019)(0.019)     
neighbourhood_simplifiedNorth          0.057 0.057 0.054 0.052 0.046      
          (0.021)(0.021)(0.021)(0.020)(0.020)     
neighbourhood_simplifiedOther          -0.377 -0.372 -0.375 -0.378 -0.383      
          (0.094)(0.093)(0.093)(0.093)(0.093)     
neighbourhood_simplifiedSouth          -0.253 -0.257 -0.258 -0.257 -0.262      
          (0.038)(0.037)(0.037)(0.037)(0.037)     
neighbourhood_simplifiedWest          -0.284 -0.283 -0.284 -0.289 -0.293      
          (0.020)(0.020)(0.020)(0.020)(0.020)     
reviews_per_month                    -0.013      -0.021      
                    (0.005)     (0.005)     
availability_30                         0.004 0.005      
                         (0.001)(0.001)     
host_response_rate                                   -0.002 
                                   (0.001)
host_identity_verifiedTRUE                                   -0.101 
                                   (0.045)
availability_90                                   -0.001 
                                   (0.000)
review_scores_communication                                   -0.131 
                                   (0.045)
review_scores_value                                   -0.325 
                                   (0.049)
latitude                                   0.817 
                                   (0.137)
longitude                                   1.989 
                                   (0.131)
host_verification_words                                   -0.013 
                                   (0.006)
#observations4505     4505     4505     4520     4520     4520     4520     1462     
R squared0.534 0.536 0.568 0.568 0.569 0.573 0.575 0.675 
Adj. R Squared0.533 0.534 0.566 0.567 0.567 0.572 0.574 0.672 
Residual SE0.487 0.486 0.469 0.469 0.468 0.466 0.465 0.433 
  1. Finally, we use the best model we came up with for prediction. Suppose you are planning to visit Sydney over reading week, and you want to stay in an Airbnb. We find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. We use our best model to predict the total cost to stay at this Airbnb for 4 nights. We can include the appropriate 95% interval with our prediction.

Report the point prediction and interval in terms of price_4_nights. - if you used a log_price_4_nights model, make sure you anti-log to convert the value in $. You can read more about hot to interpret a regression model when some variables are log transformed here #predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction

#Assume that an average rating of at least 90 means host_response_rate is greater than 90
imaginary_sydney_visit <-listings_sydney_golden_final%>%
  select(price_4_nights,
         log_price_4_nights,
         number_of_reviews,
         bathrooms_number,
         bedrooms,
         review_scores_rating,
         room_type,
         host_response_rate,
         host_identity_verified,
         availability_90,
         review_scores_communication,
         review_scores_value,
         latitude,
         longitude,
         host_verification_words)%>%
  drop_na()%>%
  filter(number_of_reviews>=10,room_type=="Private room",host_response_rate>=90)

predict_price<-exp(predict(model_final,newdata=imaginary_sydney_visit,interval="prediction",level=0.95))
predict_price
         fit       lwr       upr
1   449.3834 190.95354 1057.5630
2   238.3850 101.58326  559.4169
3   218.7959  93.15520  513.8913
4   379.6554 161.86760  890.4698
5   207.5001  88.44138  486.8342
6   309.3198 131.82413  725.8062
7   278.5258 118.65809  653.7827
8   364.3536 155.03093  856.3038
9   371.9063 158.23457  874.1092
10  290.6832 123.95277  681.6849
11  260.4699 111.05217  610.9252
12  291.1224 124.12306  682.8083
13  322.1001 137.41445  755.0040
14  317.7818 135.45651  745.5179
15  346.0085 147.52747  811.5225
16  265.9014 113.26181  624.2488
17  320.1537 136.25065  752.2780
18  345.1078 146.70395  811.8347
19  310.1358 132.30000  727.0159
20  273.5197 116.75638  640.7615
21  323.6611 137.73972  760.5397
22  256.5209 109.39937  601.4934
23  248.9907 105.82235  585.8531
24  282.1835 120.42517  661.2197
25  281.5734 120.07364  660.2915
26  303.9777 129.30274  714.6210
27  324.9024 138.26833  763.4546
28  323.5216 137.83436  759.3626
29  367.4907 156.64269  862.1497
30  316.5340 134.85651  742.9657
31  493.7067 209.91034 1161.1924
32  310.9323 132.66269  728.7573
33  295.5593 126.10673  692.7093
34  285.4599 121.48080  670.7839
35  258.2559 110.20906  605.1782
36  246.1501 104.95303  577.3048
37  270.3985 115.29229  634.1739
38  313.6007 133.74888  735.2989
39  368.4495 157.00980  864.6278
40  301.8785 128.80125  707.5289
41  263.7587 112.45140  618.6550
42  306.6588 130.49081  720.6609
43  276.6102 118.01031  648.3605
44  261.3231 111.45434  612.7153
45  289.2663 123.11410  679.6538
46  265.3857 113.07266  622.8702
47  393.2855 167.41347  923.9011
48  238.5348 101.66417  559.6744
49  280.1031 119.20356  658.1828
50  233.8934  99.80827  548.1123
51  337.6913 144.06086  791.5782
52  379.7736 161.57230  892.6529
53  332.8633 141.88578  780.8954
54  214.4016  91.40706  502.8937
55  364.3231 155.25666  854.9157
56  255.1349 108.58705  599.4620
57  293.5014 125.07396  688.7372
58  285.8919 121.98991  670.0076
59  276.6956 118.04502  648.5701
60  271.7879 115.91987  637.2389
61  365.5556 155.94996  856.8831
62  369.0501 156.94116  867.8284
63  211.4666  89.64828  498.8173
64  247.1127 105.41422  579.2833
65  304.9493 130.07438  714.9299
66  260.6110 111.19447  610.8046
67  291.6524 124.38286  683.8653
68  255.8654 109.14442  599.8207
69  242.9699 103.66626  569.4656
70  274.4692 117.12659  643.1789
71  613.9757 261.53085 1441.3833
72  272.5264 116.27313  638.7600
73  307.3064 131.02100  720.7792
74  291.9487 124.49596  684.6332
75  360.9555 153.86079  846.7970
76  290.7470 123.99065  681.7758
77  266.1269 113.46934  624.1643
78  283.0621 120.80257  663.2651
79  299.7111 127.89610  702.3417
80  262.3348 111.80421  615.5362
81  255.5116 108.95161  599.2216
82  369.0314 157.10246  866.8495
83  289.6309 123.59415  678.7219
84  242.6256 103.41111  569.2541
85  239.8123 102.19923  562.7238
86  172.8064  73.59378  405.7687
87  243.8222 103.94970  571.9043
88  256.7148 109.54345  601.6105
89  364.3741 154.96059  856.7888
90  309.2369 131.62018  726.5409
91  328.5795 139.82333  772.1491
92  359.7676 152.89654  846.5380
93  462.3034 196.82991 1085.8329
94  320.7557 136.62898  753.0188
95  264.3460 112.60911  620.5430
96  233.0864  99.39541  546.5972
97  266.3452 113.51656  624.9286
98  318.3004 135.67785  746.7330
99  242.1218 103.27957  567.6143
100 279.8776 119.42918  655.8822
101 267.1034 113.88209  626.4744
102 503.5697 213.53893 1187.5234
103 250.4854 106.78958  587.5382
104 346.4389 147.67916  812.7072
105 257.3493 109.77931  603.2889
106 495.3328 209.96929 1168.5259
107 287.3371 122.09211  676.2322
108 298.5832 127.32552  700.1892
109 191.6047  81.35340  451.2703
110 241.8640 102.87311  568.6443
111 241.7849 102.88199  568.2231
112 261.4647 111.36512  613.8708
113 227.8881  97.07188  534.9952
114 355.2555 151.55822  832.7261
115 292.2484 124.58191  685.5662
116 245.8938 104.88233  576.4912
117 196.6511  83.51397  463.0562
118 311.9531 132.46535  734.6431
119 251.5652 107.30264  589.7812
120 243.7719 103.94137  571.7142
121 383.8995 163.57287  900.9980
122 294.4400 125.54833  690.5301
123 394.3184 168.01513  925.4345
124 321.4803 137.02995  754.2118
125 321.0807 136.72878  753.9952
126 235.7241 100.01742  555.5616
127 212.1349  90.51346  497.1769
128 293.9615 125.36839  689.2753
129 209.9588  89.09893  494.7613
130 270.1487 115.14268  633.8252
131 177.1573  75.52670  415.5445
132 263.5663 112.07317  619.8380
133 342.0950 145.93447  801.9283
134 521.3894 221.94142 1224.8590
135 266.7120 113.79787  625.1020
136 270.8087 115.49463  634.9851
137 280.6930 119.69816  658.2268
138 325.8106 138.84693  764.5294
139 336.9293 143.24127  792.5183
140 343.1346 146.13242  805.7169
141 269.5391 114.99516  631.7773
142 297.7647 127.00936  698.0889
143 264.4761 112.89262  619.5943
144 261.7371 111.65848  613.5345
145 152.9919  65.06404  359.7460
146 152.1226  64.39401  359.3702
147 375.3403 159.89698  881.0695
148 362.8704 154.61717  851.6191
149 197.6487  84.30548  463.3743
150 264.1782 112.38599  620.9859
151 300.6597 127.91582  706.6855
152 216.3032  92.23330  507.2686
153 293.0957 125.07614  686.8221
154 387.6676 164.66847  912.6590
155 276.2001 117.85004  647.3185
156 122.3860  51.74257  289.4780
157 139.3276  59.29910  327.3605
158 333.1031 142.07087  781.0025
159 197.1757  83.59241  465.0930
160 114.4187  48.38859  270.5523
161 291.6972 124.41935  683.8746
162 234.0026  99.67380  549.3642
163 383.5588 162.81745  903.5722
164 235.9550 100.50462  553.9524
165 254.1709 108.16630  597.2550
166 232.4590  98.99250  545.8715
167 226.6353  96.28273  533.4658
168 327.9480 139.46563  771.1571
169 249.4668 106.29998  585.4535
170 246.9635 105.31292  579.1405
171 235.9865 100.57067  553.7360
172 359.1531 152.97537  843.2137
173 375.4582 159.92532  881.4668
174 310.3765 132.22937  728.5337
175 237.1377 100.91329  557.2536
176 202.3028  86.26116  474.4477
177 240.0079 102.34822  562.8218
178 288.5397 123.00801  676.8273
179 239.2047 101.97055  561.1316
180 227.9578  97.21187  534.5517
181 248.8424 105.97315  584.3229
182 247.5387 105.39952  581.3632
183 177.4163  75.49162  416.9543
184 211.5530  90.09725  496.7374
185 215.8600  91.93704  506.8199
186 264.7509 112.98220  620.3900
187 141.3648  60.11456  332.4321
188 284.8612 121.42113  668.3016
189 199.5481  85.08138  468.0159
# there is no confidence interval here

3 Deliverables

  • By midnight on Monday 17 Oct 2022, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
  1. Executive Summary: Based on your best model, indicate the factors that influence price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.

When looking at our best model, the following factors stick out as those that significantly influence price_4_nights. The following had coefficients of magnitude greater than 0.1, i.e. if the following variables change then the our price would change a fair bit too:

bathrooms_number bedrooms review_scores_rating All categories in room_type host_identity_verified review_scores_communication review_scores_value latitude longitude

Now from the above we can gather fairly expected qualitative factors for the airbnb market in Sydney- location, number of beds, number of baths, room type are all expected to influence the cost of the room. Similarly characteristics such as what rating the room has should too, as it is likely the more expensive rooms that are nicer. Whether a host identity is verified or not may influence price since those hosts that care enough to get verified are likely those that care more about using their property as a business venture and thus will keep nicer properties that charge more.

  1. Data Exploration and Feature Selection: Present key elements of the data, including tables and graphs that help the reader understand the important variables in the dataset. Describe how the data was cleaned and prepared, including feature selection, transformations, interactions, and other approaches you considered.

From the initial dataset, we have tried to eliminate all repetitive or less significant variables. We have initially statistically analyzed the dataset to comprehend the situation at the beginning by utilizing skrim. Later on, we modified the dataset and started analyzing potential correlations resulting from the graphs. Specifically, it appears that correlations do not result in being linear. In order to do so, after having looked at the dataset through skim and glimpse, we also analyzed specific variables through favstats in order to get peculiar insights and understand the soundness of selected variables of the dataset. After that, we plotted through ggplot graphs to visualize some data. We modified the dataset by dropping not significant variables and identifying numerical variables to perform an analysis to identify their correlations. We did it by using corrplot and ggpairs. The correlations demonstrate that the variables are not linearly related. Outside of the same typology of class (the different types of reviews, for instance), we have correlation coefficients that are quite low. This is because many variables are not linearly correlated even though.

Graphs can be found in the EDA section.

  1. Model Selection and Validation: Describe the model fitting and validation process used. State the model you selected and why they are preferable to other choices.

Progressively more variables were added to the models prior to reaching model_final. This model stuck out to us due to the large R-squared and significance of all numerical variables. We used the msummary function to validate this model, and compared this with our previously created models to also check that it was indeed the best. the neibhourhood_simplified factor was removed from our model due to it have a greater variability than what we accept (vif >5)

  1. Findings and Recommendations: Interpret the results of the selected model and discuss additional steps that might improve the analysis

In question 1, we discussed which variables were most resultant in changes in price. We were able to create a model which justified over two thirds of the changes in price. When fitting our model, our aim was to try and find every factor that had an affect on price and maximise our coverage of justifying the changes in price. In our final model, we ended up with fewer datapoints than we started with due to missing information in some of our factors. Although we still kept a very large portion of the data and kept the model statistically significant with the number of datapoints we had, we could have improved our analysis by keeping more of the data, even if it reduced our coverage of price change slightly.

4 Assessment Rubric

5 Acknowledgements